EMR Spark Setup Guide

Georgetown University

26 February, 2022

Clone an EMR Cluster

This section will teach you how to re-launch an EMR cluster using previously established settings.

  1. Log into AWS Academy and launch “AWS Lab” to get to the AWS Console. See the AWS Academy Learner Lab Student Guide in the Guides folder if you need a refresher. While you are in AWS Academy, take note of how much credit you have used in your account so far. If you run out of credits, let a member of the instructional team know.

  2. Search in the AWS Console for emr and click on the EMR service as shown in the figure below.

  1. Now that you are in the EMR Dashboard, click on the box for the cluster that you previously made. Maybe this was a Hue Cluster or a Spark Cluster. One you click on that row there will be a blue box (left yellow arrow). Then click on the Clone button (top yellow arrow).

  1. Click on the blue Clone button. You can leave the steps option as its default since there have been no steps applied to clusters in this class.

Step 4: Security

  • Select your correct EC2 key pair. Use the key pair you created in lab 1, which should be called mykeypair. If you do not select the appropriate key pair, you will not be able to connect to the cluster.

  • Leave Permissions as is.

  • Leave Security Configuration alone.

  • In the EC2 security groups, you need to change your EMR managed security groups to be the security group we created. If you used the same name as the instructions, it will be called open-22. Select this group for both the Master and Core & Task rows.

  • Click the blue Create cluster button to launch your cluster!

Cluster Startup

Once you click on Create Cluster, you will be taken to the cluster summary page where you will see all relevant information about the cluster as shown in the figure below.

The cluster will go through several states until it is ready for you to use.

  1. Starting

  2. Running

  3. Waiting

Cluster startup time can be 5-15 minutes or more! You must wait until the cluster is in Waiting state before you connect to it.

Jump to SSH Section

Start an EMR Cluster

This section will teach you how to launch an EMR or Elastic MapReduce cluster for PySpark jobs. A cluster means you will have several different machines working together in coordination. There will be one master node that coordinates running the jobs you ask the cluster to do and worker nodes to execute the work. EMR clusters in industry can be dozens, hundreds, or even thousands of cpu cores! We will be harnessing programming languages that are designed to work on clusters. As a Data Scientist, you can focus on writing code for data analysis and the cluster will handle setting up the multiple machines for you.

There are several changes when starting this cluster compared to the Hue cluster:

  1. You will be using different applications in step 1
  2. You will be add an S3 path in the Edit Software Settings section of step 1
  3. You will add a bootstrap action in step 3
  4. You will use port 8765 when ssh’ing to the cluster

You should already have an SSH keypair created and security group set up. If you have not accomplished these yet, go back to lab 1 to do so.

It is time to launch an Elastic MapReduce (EMR) Cluster! Follow these steps closely!

  1. Log into AWS Academy and launch “AWS Lab” to get to the AWS Console. See the AWS Academy Learner Lab Student Guide in the Guides folder if you need a refresher. While you are in AWS Academy, take note of how much credit you have used in your account so far. If you run out of credits, let a member of the instructional team know.

  2. Search in the AWS Console for emr and click on the EMR service as shown in the figure below.

  1. Now that you are in the EMR Dashboard, click on the blue Create cluster button as shown with the yellow arrow in the figure below

  1. Now you will be on the Create Cluster - Quick Options page. You will need to click the Go to advanced options button as shown with the yellow arrow in the figure below.

You will now complete the four step pages to create your own EMR cluster!

Step 1: Software and Steps

This step will let you choose your software and if you want specific jobs to run when the cluster start. In our class we never want to run any steps.

  • Software Configuration section

  • Use EMR version emr-6.1.0 from the Release drop down

  • NOTE THAT THIS SETTING WILL CHANGE BASED ON YOUR NEEDS

    Confirm that the following are checked: Hadoop 3.2.1, Spark 3.0.0

  • Leave the check box in Multiple master nodes (optional) unchecked. This would be useful if you are a cloud architect who wants to set up an EMR cluster that will be running for weeks or months with multiple users.

  • Leave in AWS Glue Data Catalog settings (optional) unchecked. This would be useful for if you are a data engineer or architect and have built cloud databases or work in an organization with cloud databases.

  • NOTE THAT THIS SETTING CHANGE IS NEW FOR SPARK CLUSTERS

    In the Edit software settings section, select the radio button Load Json from S3 and enter this path into the text box s3://bigdatateaching/bootstrap/cluster-config.json, which will establish several important settings for your Spark cluster.

  • Steps (optional)

  • Leave all of this section alone. We do not have any automated jobs we need to run so we’re not using this at all.

  • Click the blue Next to head to Step 2. The appropriate options are included in the figure below.

Step 2: Hardware

In this step you will apply the following steps so that your screen looks like the figure below.

  • Leave the Cluster Composition section alone, we want Uniform instance groups. You could have Instance Fleets if you were a cloud architect building a cluster for a company.

  • Leave all the defaults for the Networking section. These settings would be changed if you were setting up a virtual private cloud (VPC) for a company. This is common for workplaces to have their on VPCs. Do not worry, you will not have to set up any VPCs. Most of the time these are already configured by cloud architects so data scientists can launch EMR clusters like how we are doing.

Cluster Nodes and Instances section.

There is a table of instance groups. There are three types: Master, Core, and Task.

  1. The Master instance will coordinate your cluster activities and send jobs to the core and task nodes.
  2. The core nodes will store data for your cluster and compute jobs that are received from the master node.
  3. The task nodes are used for compute jobs and do not store data as part of the distributed file system. In this class, we will not be using task nodes.

  • In the first row of the table, you will keep the instance type as m5.xlarge.

    • If you did want to change the instance type, then you would click on the tiny pencil next to the instance name then scroll through the list of instances to select m5.xlarge and click Save
  • In the second row of the table (core nodes), keep the same instance type so that you are using m5.xlarge. Follow the previous bullet for changing instance type if necessary.

  • Change the instance count to use more core nodes. We will use up to 7 core nodes. There should be 1 master node, 7 core nodes, and 0 task nodes.

    • AWS Academy limits you to using 32 cores at once. Assuming you are not running anything else on AWS right now, you can launch 8 4-core nodes for your EMR cluster.
  • Leave Cluster scaling unchecked. This option could be useful when you are working and want to get more resources dynamically. But watch out, more resources also means more money!

  • Enable Auto-termination, and leave the idle time at 1 hour and 0 minutes. This will serve as a back up kill switch in case you forget to turn off your EMR cluster. It is always your personal responsibility to turn off your EMR cluster after you are done with your work! You will use all your credits in less than a week if you forget to turn off your EMR cluster.

  • Keep the defaults for EBS root volume.

  • Click the blue Next button at the bottom of the page to go to step 3.

Step 3: General Cluster Settings

  • General Options section

    • Give the cluster a name that is meaningful. Call the cluster Spark Cluster EMR 6.1.0

    • Leave the other settings here alone. Logging will be useful is your cluster crashes. You can leave the default S3 bucket to store your logs. We do not need to encrypt our logs. If you were working with confidential of classified data then you should encrypt your logs! Leave Debugging and Termination protection checked.

    • Leave the Tags section alone

  • Open the Additional Options section

    NOTE THAT THERE IS A NEW STEP HERE FOR SPARK CLUSTERS
    • Go to the Add bootstrap action dropdown and select Custom action from the dropdown as shown in the figure below, and click on the Configure and add button.

    • In the Add Bootstrap Action dialog box, enter the following location in the Script location section: s3://bigdatateaching/bootstrap/bigdata-bootstrap_emr6.sh

    Here is a summary of what this script does:

    • Installs Miniconda and Python3 on every node of the cluster, with many additional Python libraries

    • Installs and starts JupyterLab automatically on port 8765, and you can use it for many repositories

    • Installs git

    • Tells YARN to allocate the most possible resources to Spark

    • Make sure you see the custom action in the screen then click the blue Next button to go to step 4.

Step 4: Security

  • Select your correct EC2 key pair. Use the key pair you created in lab 1, which should be called mykeypair. If you do not select the appropriate key pair, you will not be able to connect to the cluster.

  • Leave Permissions as is.

  • Leave Security Configuration alone.

  • In the EC2 security groups, you need to change your EMR managed security groups to be the security group we created. If you used the same name as the instructions, it will be called open-22. Select this group for both the Master and Core & Task rows.

  • Click the blue Create cluster button to launch your cluster!

Cluster Startup

Once you click on Create Cluster, you will be taken to the cluster summary page where you will see all relevant information about the cluster as shown in the figure below.

The cluster will go through several states until it is ready for you to use.

  1. Starting

  2. Running

  3. Waiting

Cluster startup time can be 5-15 minutes or more! You must wait until the cluster is in Waiting state before you connect to it.

SSH Into and Use the Cluster

  1. Go to the EMR console and click on the cluster of interest which takes you to the cluster’s summary page

  2. Copy the Master Public DNS from the Summary Section. In the figure below that looks like this ecX-XXX-XXX-XXX.compute-1.amazonaws.com. Yours will be different.

  1. Open the same terminal on your local laptop that you used in the terminal setup section. Add your private key to memory using ssh-add (use the right approach based on your operating system.)

  2. Run the command ssh -A -L 8765:localhost:8765 hadoop@[[YOUR MASTER NODE DNS ADDRESS]]

USE 8765 FOR THE PORT NUMBER HERE. IT WILL CHANGE FOR A HUE CLUSTER
  • Note the username is hadoop. Get your cluster’s master node IP address from the Cluster console.

    • Use 8765 for the #### port number.
    • The #### depends on the type of cluster you are using. For a spark cluster, use 8765. We will be using Spark later on in the semester. For a Hue cluster, use 8888. All the communications coming from and going to your AWS machine are going through the port number that you specify. We use standard ports to make things simple for the applications to communicate with you. Other common port numbers are 443 for secure websites, and 80 for non-secure websites. See a whole list of them here

Your command and login message output will look something like the following:

Windows PowerShell
Copyright (C) Microsoft Corporation. All rights reserved.

Try the new cross-platform PowerShell https://aka.ms/pscore6

PS C:\Users\Monke> ssh-agent
PS C:\Users\Monke> ssh-add
Identity added: C:\Users\Monke/.ssh/id_rsa (C:\Users\Monke/.ssh/id_rsa)
PS C:\Users\Monke> ssh -L 8765:localhost:8765 hadoop@ec2-54-211-241-107.compute-1.amazonaws.com
Last login: Thu Oct 21 03:05:27 2021 from pool-108-31-220-48.washdc.fios.verizon.net

       __|  __|_  )
       _|  (     /   Amazon Linux 2 AMI
      ___|\___|___|

https://aws.amazon.com/amazon-linux-2/
13 package(s) needed for security, out of 44 available
Run "sudo yum update" to apply all updates.

EEEEEEEEEEEEEEEEEEEE MMMMMMMM           MMMMMMMM RRRRRRRRRRRRRRR
E::::::::::::::::::E M:::::::M         M:::::::M R::::::::::::::R
EE:::::EEEEEEEEE:::E M::::::::M       M::::::::M R:::::RRRRRR:::::R
  E::::E       EEEEE M:::::::::M     M:::::::::M RR::::R      R::::R
  E::::E             M::::::M:::M   M:::M::::::M   R:::R      R::::R
  E:::::EEEEEEEEEE   M:::::M M:::M M:::M M:::::M   R:::RRRRRR:::::R
  E::::::::::::::E   M:::::M  M:::M:::M  M:::::M   R:::::::::::RR
  E:::::EEEEEEEEEE   M:::::M   M:::::M   M:::::M   R:::RRRRRR::::R
  E::::E             M:::::M    M:::M    M:::::M   R:::R      R::::R
  E::::E       EEEEE M:::::M     MMM     M:::::M   R:::R      R::::R
EE:::::EEEEEEEE::::E M:::::M             M:::::M   R:::R      R::::R
E::::::::::::::::::E M:::::M             M:::::M RR::::R      R::::R
EEEEEEEEEEEEEEEEEEEE MMMMMMM             MMMMMMM RRRRRRR      RRRRRR

[hadoop@ip-172-31-83-145 ~]$
  1. Open a browser and navigate to http://localhost:8765 to see your Jupyter Lab environment. Your environment will look like the figure below.

TERMINATE YOUR EMR CLUSTER!

Go to the cluster’s summary page where you grabbed the Master DNS Address. Click on the gray Terminate button.

Then you will get a popup window about shutting down your instance. If the red Terminate button is red like the figure below, click it.

If the button is grayed out like below, you will need to turn off cluster termination protection. To do so, click the Change link.

Then choose the off radio button, then click the green check mark symbol. Next click the red Terminate button.

Once you have successfully terminated the cluster, you will see the yellow terminating text on the cluster’s summary page. That means you can close the page and the cluster will terminate.